Can we improve the odds of finding new fossil sites during palaeoanthropological reconnaissance? Much has changed since the early days of human palaeontology in Africa when miners uncovered hominin fossils from limestone breccias with explosive charges. Today, the use of military, topographic, and geological maps, aerial photos taken by planes or drones, and software like Google Earth or ArcGIS has become ubiquitous. These spatial datasets can aid in identifying high-priority areas prior to systematic surveying efforts with multidisciplinary teams (Njau and Hlusko, 2010). While there are still major fossil finds being attributed to serendipity, modern surveying is no longer just a question of being in the right place at the right time.
Geospatial palaeontology is a new and emergent discipline that uses remote sensing data, GIS and statistical algorithms to model the probability of finding new fossiliferous deposits (Anemone et al., 2011). Most of this research has been applied to identifying new Eocene deposits containing, among other vertebrates, early primates (Conroy, 2014; Emerson et al., 2015). However, such approaches have yet to be applied to regions known to be rich in hominin-bearing fossil deposits, such as the Omo-Turkana basin.
# The following are all essential packages you should install in R if you would like to run all the code contained in this notebook. The command to install packages is install.packages("nameofpackage")
library(raster)
library(maptools)
library(ggplot2)
library(zoo)
library(oneClass) # for svm
library(dismo) # for maxEnt
rangePlot <- function(x){(x - min(x))/(max(x) - min(x))} # function to standardize ranges used in plots
Various supervised learning algorithms for one-class classification are currently being applied to a dataset of Koobi Fora fossiliferous areas. The purpose of a one-class classifier is identical to the purpose of a supervised binary classifier. In contrast, however, to most standard classification problems, the training data of the one-class classifier only contains labelled samples from the positive class (here, fossil sites), while in binary classification, the negative class, has to be mapped in the training set. But collecting a representative training set for the negative class can be very costly and time-consuming due to the fact that the negative class is the aggregation of all other classes without the positive class (Mack and Waske, 2017). Thus, a one-class classifier is particularly useful when only one class has to be mapped and when the acquisition of representative labelled data for the negative class is expensive or complicated to define. Moreover, why should we assume a priori that other classes of landcover (e.g. shrubland, grassland, etc.) are unfossiliferous, instead of letting the model determine this from the data distribution? An early pipeline has already been developed using a subset of georeferenced fossils from The Turkana Database (Bobe et al., 2011).
We begin by loading a landsat8 scene in R, downloaded from USGS. For reproducibility we provide here its ID (LANDSAT_SCENE_ID = “LC81690572018036LGN00”) so others can download the same file and run all the analyses. We used a good quality, free-of-clouds (over Koobi Fora), and most recent satellite image available, representing a northeast section of lake Turkana (from 5th of February, 2018).
An imagem preview of LC81690572018036LGN00
# read landsat8 data
rawLandsat8 <- list.files("LANDSAT8/",
pattern = glob2rx("*B*.TIF$"),
full.names = TRUE)
stackLandsat8 <- stack(rawLandsat8) # stack all bands into a dataframe
extent(stackLandsat8) # verifies the extent of Landsat
## class : Extent
## xmin : 141285
## xmax : 369315
## ymin : 363285
## ymax : 595815
crs <- crs(stackLandsat8) # obtain coordinate reference system from landsat metadata
We begin by acquiring the coordinates for (individual) fossils. In the following chuck of code, we load coordinate data, proccess it, and give it a coordinate reference system matching data of the landsat8.
## obtain KML data from fossil sites known from Koob Fora
coords <- getKMLcoordinates('EastTurkanaFossilsCoords.kml', ignoreAltitude = TRUE)
## break down list object into bidimensional lon/lat-coordinates matrix
coords <- matrix(unlist(coords), ncol = 2, byrow = TRUE)
## convert it into SpatialPoints class object
coords <- SpatialPoints(unlist(coords), CRS('+init=epsg:4326'))
## use coordinate system defined above to convert the fossil sites coordinates to it
coords <- spTransform(coords, crs)
extent(coords)
## class : Extent
## xmin : 188952.2
## xmax : 215530.8
## ymin : 406463.5
## ymax : 478846.6
We can see that the area with known fossils is much smaller than the whole satellite image. Following we crop a subset of the landsat8 data to fit a viable window of analysis containing all known fossil sites. Then we rasterize the individual fossil coordinates to be indicative of “presence of fossil(s)” for each 30x30 cell in the grid. This will force the 1903 georeferenced fossils to be transformed into 1864 ‘presence’ datapoints. In this way, overlapping fossils in the same cell of the grid are only counted once, in order to avoid turning this into a count-regression poisson-ish problem (i.e. a nightmare).
# define our working area (east, west, south, north extent) based on intersection of previous extent() results
zone <- extent(188000, 216000, 406000, 480000)
croppedLandsat8 <- crop(stackLandsat8, extent(zone))
## are the coords gridded? Use croppedLandsat8 as template for rasterizing coords (by mean)
rcoo <- rasterize(coords, croppedLandsat8, fun = min)
Next step we just stack the different datasets into one object.
# get elevation, slope and aspect
source('LOADING_DEM.R') # another function simply to calculate variables from a SRTM file.
dem <- stack(areaDEM)
rm(areaDEM)
# adjust to same window/grid of raster objects
dem <- resample(dem, croppedLandsat8)
# concatenate the spectral bands and DEM variables
koobiFora <- stack(croppedLandsat8, dem)
We also add the fossil raster object created two steps ago into a final brick raster. Moreover we make simple numeric matrix clones of the created objects, since these will allow for faster computations (important for OCSVM).
rcoo <- resample(rcoo, koobiFora) # use koobi fora as template for coordinates resampling
rcoo <- reclassify(rcoo, c(0, 9999, 1)) # defines every value as 1 (presence of fossil[s]). Only NAs remaing as NA.
KF <- stack(koobiFora, rcoo)
KFdataset <- brick(KF) # turn multilayer band into a multilayer raster object
# To make computation faster, let's keep a numeric matrix version of this raster object:
nr <- getValues(KFdataset)
nr[,8] <- na.locf(nr[,8]) # DEM data has some empty values, this removes those
storage.mode(nr) <- "integer" # makes it faster
x <- nr[,1:8] # all the variables used to predict
y <- nr[,9] # target variable (i.e. fossil deposits)
Last we get some other fossil coordinates for digital validation of the algorithm outputs within Koobi Fora (vcoords). And we also prepare some objects for model validation and plotting.
## obtain coordinates without metadata associated to use as validation set
vcoords <- getKMLcoordinates('ValidationFossilsCoords.kml', ignoreAltitude = TRUE)
## convert list into bidimensional lon/lat-coordinates matrix
vcoords <- matrix(unlist(vcoords), ncol = 2, byrow = TRUE)
## convert it into SpatialPoints class object
vcoords <- SpatialPoints(unlist(vcoords), CRS('+init=epsg:4326'))
## use coordinate system defined above to convert the fossil sites coordinates to it
vcoords <- spTransform(vcoords, crs)
# create some usefull objects for plotting:
vcoo <- crop(vcoords, extent(KFdataset))
df.vcoo <- as.data.frame(vcoo@coords)
coo <- crop(coords, extent(KFdataset))
df.coo <- as.data.frame(coo@coords)
valCoo <- rasterize(vcoords, KFdataset, fun = mean)
valCoo[!is.na(values(valCoo$layer))] <- 1
valCoo[is.na(values(valCoo$layer))] <- 0
valCoo <- as.factor(valCoo)
allCoo <- rbind(coords, vcoords)
allCoo <- rasterize(allCoo, KFdataset, fun = mean)
allCoo[!is.na(values(allCoo$layer))] <- 1
allCoo[is.na(values(allCoo$layer))] <- 0
allCoo <- as.factor(allCoo)
df.coo$`Fossil sites` <- 'Used for modelling'
df.vcoo$`Fossil sites` <- 'Validation set'
df.coords <- rbind(df.coo, df.vcoo)
On a raster object of 2072 km2 (pixel resolution: 30x30m) containing 7 spectral bands (ultraviolet, blue, green, red and 3 infrared) of LANDSAT 8 and a digital elevation model (from SRTM), we trained a one-class support vector machines (OCSVM) and a maximum entropy (maxEnt) algorithms to detect 1864 cells known to be fossiliferous. This is the very first implementation of one-class classifiers for remote fossil detection in Africa (first plot, jcoelho.com/images/Figure1.png). The final step is to validate or groundtruth these predictions in the field.
Fitting one class svm.
y_caret <- y
y_caret[is.na(y_caret)] <- 0
tr.y <- puFactor(y_caret, positive = 1)
rm(y_caret)
set.seed(123456)
tr.index <- createFolds(tr.y, k = 2, returnTrain = TRUE)
tr.tune <- expand.grid(sigma = c(0.01, 1), nu = c(0.01, 0.1))
ocsvm.fit <- trainOcc(x = x, y = tr.y, method = "ocsvm", index = tr.index,
tuneGrid = tr.tune, allowParallel = TRUE)
## + Fold1: sigma=0.01, nu=0.01
## - Fold1: sigma=0.01, nu=0.01
## + Fold1: sigma=1.00, nu=0.01
## - Fold1: sigma=1.00, nu=0.01
## + Fold1: sigma=0.01, nu=0.10
## - Fold1: sigma=0.01, nu=0.10
## + Fold1: sigma=1.00, nu=0.10
## - Fold1: sigma=1.00, nu=0.10
## + Fold2: sigma=0.01, nu=0.01
## - Fold2: sigma=0.01, nu=0.01
## + Fold2: sigma=1.00, nu=0.01
## - Fold2: sigma=1.00, nu=0.01
## + Fold2: sigma=0.01, nu=0.10
## - Fold2: sigma=0.01, nu=0.10
## + Fold2: sigma=1.00, nu=0.10
## - Fold2: sigma=1.00, nu=0.10
## Aggregating results
## Selecting tuning parameters
## Fitting sigma = 1, nu = 0.1 on full training set
Predictions for the whole region are created (plus, a plot to visualize it).
ocsvm.pred <- predict(ocsvm.fit, x) # create predictions
numpredSVM <- as.matrix(as.numeric(ocsvm.pred))
classcover <- raster(numpredSVM, crs = crs(KFdataset), template = KFdataset)
class_spdf <- as(classcover, "SpatialPixelsDataFrame")
class_df <- as.data.frame(class_spdf)
colnames(class_df) <- c("value", "longitude", "latitude")
class_df$value <- rangePlot(class_df$value)
ggplot() +
geom_tile(data = class_df, aes(longitude, latitude, fill = value)) +
viridis::scale_fill_viridis() + coord_equal() +
geom_point(data = df.coords, aes(coords.x1, coords.x2, color = `Fossil sites`, shape = `Fossil sites`,
size = `Fossil sites`), alpha = 0.5) +
scale_colour_manual(values = c('#ff00de', 'red')) +
scale_shape_manual(values = c(21, 22)) +
scale_size_manual(values = c(1, 0.8)) +
ggtitle("OCSVM (oneClass, prob. density, the closest to 1, the more likely to be fossiliferous)") +
theme_minimal() + labs(fill = 'Prediction')
Confusion matrix results for ALL DATA INSTANCES.
ocsvm.CM <- ocsvm.pred
qv <- quantile(ocsvm.CM, 0.75) # choose a cut-off, or threshold value. here top 25% predictions...
ocsvm.CM[ocsvm.pred < qv] <- "0"
ocsvm.CM[ocsvm.pred >= qv] <- "1"
table(ocsvm.CM)
## ocsvm.CM
## 0 1
## 1728133 576045
ocsvm.CM <- as.factor(ocsvm.CM)
allfossils <- as.factor(getValues(allCoo)) # using all instances
confusionMatrix(ocsvm.CM, allfossils, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1727945 188
## 1 574108 1937
##
## Accuracy : 0.7508
## 95% CI : (0.7502, 0.7513)
## No Information Rate : 0.9991
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0049
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9115294
## Specificity : 0.7506104
## Pos Pred Value : 0.0033626
## Neg Pred Value : 0.9998912
## Prevalence : 0.0009222
## Detection Rate : 0.0008406
## Detection Prevalence : 0.2500002
## Balanced Accuracy : 0.8310699
##
## 'Positive' Class : 1
##
validation <- as.factor(getValues(valCoo)) # using only separated valdiation set
confusionMatrix(ocsvm.CM, validation, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1728037 96
## 1 575663 382
##
## Accuracy : 0.7501
## 95% CI : (0.7496, 0.7507)
## No Information Rate : 0.9998
## P-Value [Acc > NIR] : 1
##
## Kappa : 9e-04
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7991632
## Specificity : 0.7501137
## Pos Pred Value : 0.0006631
## Neg Pred Value : 0.9999444
## Prevalence : 0.0002074
## Detection Rate : 0.0001658
## Detection Prevalence : 0.2500002
## Balanced Accuracy : 0.7746385
##
## 'Positive' Class : 1
##
Confusion matrix results ONLY for VALIDATION SET (not used for training the algorithm).
#
The final trained OCSVM model resulted in a balanced accuracy of 83.30% and a true positive rate (TPR) of 90.92% for all known georeferenced fossiliferous areas in Koobi Fora. Even when tested in a separated validation set (i.e. not used during model training) containing 478 fossiliferous cells (each 30x30m), the TPR was 80%, as it correctly detected 380 of these cells in the grid (second plot, Figure1). With an extremely high negative predictive value (= 99.99%), this model is particularly good at telling palaeoanthropologists where they should NOT expend surveying effort, a desirable goal for cost-effective fieldwork. This useful information can be further enhanced by increasing the cut-off value of the predictive class to a higher degree of confidence (e.g. >95%; last plot, see here). NOTE: This metrics are all problematic for ONE-CLASS problems and should not be read the same way as in other classification problems. Still there are currently no other solutions…
Since the plots are too big to be explored here, please open the following image in another tab: OCSVMresults
A very popular model in ecology is maxEnt. Studies show that maxEnt predictions compare favorably to presence/pseudo-absence models when tested against real pres/abs data (Elith et al 2006). Another advantage is that it can model multiple species, but that is not our objective here. Here we are modelling “fossil sites” as if they were analagous to a single taxon in niche ecology studies.
We start to fit a maxent model to predict the coords (our y variable) from koobiFora (our x variables).
maxent.fit <- maxent(koobiFora, coords)
## Loading required namespace: rJava
We then use the model just generated in the chuck of code above to predict in our exentent area (same used for training) the probability of each cell in the grid to contain fossiliferous deposits.
maxent.pred <- predict(maxent.fit, x)
Confusion matrix for validation data:
Confusion matrix results for ALL DATA INSTANCES.
maxent.CM <- maxent.pred
maxent.CM <- rangePlot(maxent.CM)
maxent.CM[maxent.pred < 0.5] <- "0"
maxent.CM[maxent.pred >= 0.5] <- "1"
maxent.CM <- as.factor(maxent.CM)
confusionMatrix(maxent.CM, allfossils, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2035901 459
## 1 266152 1666
##
## Accuracy : 0.8843
## 95% CI : (0.8839, 0.8847)
## No Information Rate : 0.9991
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0105
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7840000
## Specificity : 0.8843849
## Pos Pred Value : 0.0062206
## Neg Pred Value : 0.9997746
## Prevalence : 0.0009222
## Detection Rate : 0.0007230
## Detection Prevalence : 0.1162315
## Balanced Accuracy : 0.8341925
##
## 'Positive' Class : 1
##
Confusion matrix results ONLY for VALIDATION SET (not used for training the algorithm).
confusionMatrix(maxent.CM, validation, positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2036241 119
## 1 267459 359
##
## Accuracy : 0.8839
## 95% CI : (0.8835, 0.8843)
## No Information Rate : 0.9998
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0023
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7510460
## Specificity : 0.8839002
## Pos Pred Value : 0.0013405
## Neg Pred Value : 0.9999416
## Prevalence : 0.0002074
## Detection Rate : 0.0001558
## Detection Prevalence : 0.1162315
## Balanced Accuracy : 0.8174731
##
## 'Positive' Class : 1
##
After this intensive computation we can then generate some plots to better understand the results:
numpredME <- as.matrix(as.numeric(maxent.pred))
classcoverME <- raster(numpredME, crs = crs(KFdataset), template = KFdataset)
class_spdfME <- as(classcoverME, "SpatialPixelsDataFrame")
class_dfME <- as.data.frame(class_spdfME)
colnames(class_dfME) <- c("value", "longitude", "latitude")
class_dfME$value <- rangePlot(class_dfME$value)
ggplot() +
geom_raster(data = class_dfME, aes(longitude, latitude, fill = value)) +
viridis::scale_fill_viridis() + coord_equal() +
geom_point(data = df.coords, aes(coords.x1, coords.x2, color = `Fossil sites`,
shape = `Fossil sites`,
size = `Fossil sites`), alpha = 0.5) +
scale_colour_manual(values = c('#ff00de', 'red')) +
scale_shape_manual(values = c(21, 22)) +
scale_size_manual(values = c(1, 0.8)) +
ggtitle("maxEnt (probability density, the closest to 1, the more likely to be fossiliferous)") +
theme_minimal() + labs(fill = 'Prediction')
binary_df <- class_dfME
summary(binary_df)
## value longitude latitude
## Min. :0.00000 Min. :188010 Min. :406020
## 1st Qu.:0.01271 1st Qu.:195000 1st Qu.:424500
## Median :0.08946 Median :202005 Median :443010
## Mean :0.17909 Mean :202005 Mean :443010
## 3rd Qu.:0.24822 3rd Qu.:209010 3rd Qu.:461520
## Max. :1.00000 Max. :216000 Max. :480000
binary_df$value <- cut(binary_df$value, 2, labels = c('unfossiliferous', 'fossiliferous'))
table(binary_df$value)
##
## unfossiliferous fossiliferous
## 2036358 267820
my_palette <- c(rev(viridis::viridis(n = length(levels(binary_df$value)))))
ggplot() +
geom_raster(data = binary_df, aes(longitude, latitude, fill = as.factor(value) )) +
coord_equal() + scale_fill_manual(values = my_palette) +
geom_point(data = df.coords, aes(coords.x1, coords.x2, color = `Fossil sites`, shape = `Fossil sites`,
size = `Fossil sites`), alpha = 0.5) +
scale_colour_manual(values = c('#ff00de', 'red')) +
scale_shape_manual(values = c(21, 22)) +
scale_size_manual(values = c(1, 0.8)) +
ggtitle("maxEnt (50% threshold)") +
theme_minimal() + labs(fill = 'Prediction')
So maxEnt is not guessing correctly as many fossil sites as OCSVM, but is giving us a way smaller region to explore (which is the final objective, precision).
Bigger versions of these two plots can be found here, and here.
A poster presentation of the digital framework being currently developed and just explained above will be submitted to the Paleoanthropology Society 2019 Meeting (Albuquerque, New Mexico). Then, after groundtruthing our model at Koobi Fora, I will prepare a podium presentation for the ESHE meeting 2019 with all the field results. Since I will have digital and groundtruthed results, I am aiming for at least two papers in international scientific journals for 2019. These will compare the 2 different approaches tested for one-class classfication: OCSVM and maxEnt, and possibly others if successfully implemented before going to the field.
Ultimately, the major contribution of this project is to enhance the palaeoanthropological tool kit for surveying and guiding professionals directly to more productive localities. In order to achieve that, models will have to be developed with very high rates of accuracy in a myriad of different modern ecological settings. In Turkana we will extend and generalize the approaches already proven successful in Gorongosa.